In this document, we are considering the Chest X-ray images dataset. It is a collection of X-ray images of patients' chest, who are either healthy or infected with pneumonia.
This dataset was sourced from kaggle and is a collection of attributes describing around 5856 images.
The ability to verify a patient's medical state regarding pneumonia as soon as possible significantly increases the chances of survival.
Despite the advancements in modern medicine and medical technology, pneumonia remains one of most persistent bacteria known to mankind. Though many patients infected with pnuemonia recover, it may become deadly swiftly if the patient is not attended to in a timely manner. For patients that recover from pneumonia, it may leave health impairments that linger for a long time.
We hope that our machine learning model will significantly reduce the number of patients admitted to a hospital due to a pneumonia infection and ultimately avoid a fatal outcome. Our model will be used by health professionals, and since it deals with a bacterium that can prove deadly if not treated properly and on time, it must be as accurate and consistent as possible.
Our target accuracy percentage: 98%
Because of the dataset size, we'll be using images specifically from the test folder since its a smaller sub-dataset since it allows our code to finish alot faster. This sub-dataset contained 624 observations, where there are normal 234 images and 390 pneumonia images.
# import necessary libraries
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import cv2
import os
# Add name of dataset to absolute path
DataDirectory = os.getcwd() + '\\chest_xray\\test'
We will need to create the categories for our images and the data structures where we will store our dataset beforehand.
# create categories for images
Category = ['Normal', 'Pneumonia']
Category = np.asarray(Category)
Name = []
Img_arr = []
Because the current size of the images are large, we will need to resize them before continuing. Otherwise, let's separate the images into their respective categories as an array of pixels.
# define new image size
IMG_SIZE = (224, 224)
# iterate over 'normal' and 'pneuomonia' images
for category in Category:
path = os.path.join(DataDirectory, category)
# iterate over each image
for Img in os.listdir(path):
Name.append(Img[:3]) # append name of image to name_array
img_raw = cv2.imread(os.path.join(path,Img),cv2.IMREAD_GRAYSCALE) # read raw image
img_resized = cv2.resize(img_raw, IMG_SIZE) # resize image
Img_arr.append(img_resized) # append resized image to array
# convert list to numpy array
Img_arr = np.asarray(Img_arr)
Lets print out the number of observations we got from the dataset, plus additional information.
# get number of images
n_samples = Img_arr.shape[0]
print("n_samples: {}".format(n_samples))
# get number of features
h = Img_arr.shape[1]
w = Img_arr.shape[2]
n_features= h*w
pixels = Img_arr.flatten().reshape(n_samples, n_features)
print("\nOriginal Image Sizes {} by {}".format(h,w))
print("n_classes: {}".format(len(Category)))
print("n_features: {}".format(pixels.shape[1]))
n_samples: 624 Original Image Sizes 224 by 224 n_classes: 2 n_features: 50176
We observe that there are 624 images in the dataset, each resized to 224px by 224px.
The images are split into two categories: Normal for healthy patients, and pneumonia for those infected with the bacterium.
Lastly, if each pixel found in an image represents a dimension of the data, then there are 50,176 dimensions in this dataset! Next to impossible to visualize.
# Helper function to plot a gallery of portraits
def plot_gallery(images, name, h, w, n_row = 3, n_col = 6):
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom = 0, left = .01, right = .99, top = .90, hspace = .35)
# iterate over each image
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(name, size=12) # remove the [i] index from here
plt.xticks(())
plt.yticks(())
Lets see if we can replicate the images from the arrays.
# show gallery of normal images
plot_gallery(pixels, Category[0], h, w) # defaults to showing a 3x6 grid of images
The above grid of images is composed of X-ray scans from healthy individuals.
# show gallery of pneumonia images
plot_gallery(pixels[234:], Category[1], h, w) # defaults to showing a 3x6 grid of images
The above grid of images is composed of X-ray sacans of individuals who are infected with pneumonia.
We are able to read in the images, turn them into arrays, and reproduce them into images once more. Now let us begin our data analysis of the images.
Principle Component Analysis (PCA) is a technique that reduces the number of feature of the dataset's observations, significantly reducing the computing power necessary to analyze it while retaining as much of the original data as possible.
An important concept when building a PCA model is explained variance. Accoding to the article, PCA Explained Varience Concepts with Python Examples, it is the "variability in a dataset that can be attributed to each indivudial component". In other words, it is how much the PCA is able to represent the original data.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA().fit(pixels)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show()
From the above graph, we observe that the graph shows diminishing returns after an approximate number of 250 components. After 250, the value of explained variance does not significantly change. For now, we'll use above a bit above that and use 340 components.
The following block of code will allow us to see the relationship between the explained variance and the number of components.
# function for producing graph of explained variance and number of components
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode()
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [
Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
from sklearn.decomposition import PCA
# Set the number of components to consider
n_components = 340
# Fit the PCA model on the data
pca = PCA(n_components = n_components, svd_solver = 'full')
pca.fit(pixels)
# Compute the cumulative explained variance ratio
cumulative_var = np.cumsum(pca.explained_variance_ratio_)
# Set the desired amount of variance to capture
desired_var = 0.95
# Find the number of components needed to capture the desired amount of variance
n_components_needed = np.argmax(cumulative_var >= desired_var) + 1
try:
assert True in (cumulative_var >= desired_var)
print("Number of components needed to capture", desired_var, "of the variance:", n_components_needed)
plot_explained_variance(pca)
except AssertionError:
print("Desired variance not met")
Number of components needed to capture 0.95 of the variance: 244
We want to represent 95% of the original data, and only need 244 components are needed! With a more accurate value, We will settle on this number since anything more will not provide any significant value and adds additional computational time and power.
Lets try to get the same variance with a smaller number of components.
# Set the number of components to consider
n_components = 243
# Fit the PCA model on the data
pca = PCA(n_components = n_components, svd_solver = 'full')
pca.fit(pixels)
# Compute the cumulative explained variance ratio
cumulative_var = np.cumsum(pca.explained_variance_ratio_)
# Set the desired amount of variance to capture
desired_var = 0.95
# Find the number of components needed to capture the desired amount of variance
n_components_needed = np.argmax(cumulative_var >= desired_var) + 1
try:
assert True in (cumulative_var >= desired_var)
print("Number of components needed to capture", desired_var, "of the variance:", n_components_needed)
plot_explained_variance(pca)
except AssertionError:
print("Desired variance not met")
Desired variance not met
Anything lower than the needed number of components does not produce the desired variance of 0.95. Therefore, we will use 244 as our value moving forward.
# lets do some PCA of the features and go from 50176 features to 244 features
n_components = 244
print ("Extracting the top %d eigenfaces from %d faces" % (n_components, pixels.shape[0]))
pca = PCA(n_components = n_components, svd_solver = 'full',)
%time pca.fit(pixels.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
plot_gallery(eigenfaces,Category[0], h, w)
Extracting the top 244 eigenfaces from 624 faces CPU times: total: 58.2 s Wall time: 4.21 s
Though not exact, the images above partly represent their original image. Lets make a closer comparison between an image and its PCA counterpart.
# function for reconstructing an image from its low dimensional representation
def reconstruct_image(trans_obj,org_features):
low_rep = trans_obj.transform(org_features)
rec_image = trans_obj.inverse_transform(low_rep)
return low_rep, rec_image
# Range of pictures in dataset:
# 0 - 234 = normal
# 234 - 624 = pneumonia
idx_to_reconstruct = 0
# change number to change picture
X_idx = pixels[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
# original image
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original - Normal')
plt.grid(False)
# reconstruced image
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA - Normal')
plt.grid(False)
The image above on the left is an X-ray chest scan of a healthy patient. We observe that the image on the right is a low-dimensional reconstruction of the original data.
# Range of pictures in dataset:
# 0 - 234 = normal
# 234 - 624 = pneumonia
idx_to_reconstruct = 234
# change number to change picture
X_idx = pixels[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
# original image
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original - Pneumonia')
plt.grid(False)
# reconstruced image
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Full PCA - Pneumonia')
plt.grid(False)
The image above on the left is an X-ray chest scan of a patient with pneumonia. We observe that the image on the right is a low-dimensional reconstruction of the original data.
Assuming we want a variance of 0.95, lets see how many components are need with a Randomized PCA model.
# Set the number of components to consider
n_components = 340
# Fit the PCA model on the data
rpca = PCA(n_components = n_components, svd_solver = 'randomized')
rpca.fit(pixels)
# Compute the cumulative explained variance ratio
cumulative_var = np.cumsum(rpca.explained_variance_ratio_)
# Set the desired amount of variance to capture
desired_var = 0.95
# Find the number of components needed to capture the desired amount of variance
n_components_needed = np.argmax(cumulative_var >= desired_var) + 1
try:
assert True in (cumulative_var >= desired_var)
print("Number of components needed to capture", desired_var, "of the variance:", n_components_needed)
plot_explained_variance(pca)
except AssertionError:
print("Desired variance not met")
Number of components needed to capture 0.95 of the variance: 244
It appears that the number of components are the same for both the PCA and RPCA models.
n_components = 244
print ("Extracting the top %d eigenfaces from %d faces" % (n_components, pixels.shape[0]))
rpca = PCA(n_components=n_components, svd_solver='randomized')
%time rpca.fit(pixels.copy())
eigenfaces = rpca.components_.reshape((n_components, h, w))
plot_gallery(eigenfaces,Category[0], h, w)
Extracting the top 244 eigenfaces from 624 faces CPU times: total: 57 s Wall time: 3.61 s
# Range of pictures in dataset:
# 0 - 234 = normal
# 234 - 624 = pneumonia
idx_to_reconstruct = 0
# change number to change picture
X_idx = pixels[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
# original image
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original- Normal')
plt.grid(False)
# reconstruced image
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomized PCA- Normal')
plt.grid(False)
The above picture on the left is an X-ray chest scan of a healthy patient. On the right is a lower-dimensional replication of the original image.
# Range of pictures in dataset:
# 0 - 234 = normal
# 234 - 624 = pneumonia
idx_to_reconstruct = 234
# change number to change picture
X_idx = pixels[idx_to_reconstruct]
low_dimensional_representation, reconstructed_image = reconstruct_image(pca,X_idx.reshape(1, -1))
# original image
plt.subplot(1,2,1)
plt.imshow(X_idx.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Original')
plt.grid(False)
# reconstruced image
plt.subplot(1,2,2)
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title('Reconstructed from Randomized PCA')
plt.grid(False)
The above picture on the left is an X-ray chest scan of a patient infected with pneumonia. On the right is a lower-dimensional replication of the original image.
Focusing on the PCA model, let's see how useful it is for image classification.
The following graph is interactive for you to choose a pacticular image in the dataset and see which images are closest to it.
from sklearn.metrics.pairwise import pairwise_distances
from ipywidgets import widgets
from ipywidgets import fixed
import copy
# find the pairwise distance between all the different image features
X_pca_features = pca.transform(copy.deepcopy(pixels))
dist_matrix_pca = pairwise_distances(copy.deepcopy(X_pca_features), metric="seuclidean")
def closest_image(dmat_pca, idx1):
distances = copy.deepcopy(dmat_pca[idx1,:])
distances[idx1] = np.infty
idx2 = np.argmin(distances)
distances[idx2] = np.infty
idx3 = np.argmin(distances)
plt.figure(figsize=(10,16))
plt.subplot(1,3,1)
plt.imshow(pixels[idx1].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Original-PCA")
plt.grid(False)
plt.subplot(1,3,2)
plt.imshow(pixels[idx2].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Closest-PCA")
plt.grid(False)
plt.subplot(1,3,3)
plt.imshow(pixels[idx3].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Next Closest-PCA")
plt.grid(False)
widgets.interact(closest_image,idx1=(0,n_samples-1,1), dmat_pca=fixed(dist_matrix_pca), manual=True)
interactive(children=(IntSlider(value=311, description='idx1', max=623), Output()), _dom_classes=('widget-inte…
<function __main__.closest_image(dmat_pca, idx1)>
As we slide through the images, most of the results are inaccurate.
Lets do the same but with the RPCA model.
from sklearn.metrics.pairwise import pairwise_distances
from ipywidgets import fixed
# find the pairwise distance between all the different image features
X_rpca_features = rpca.transform(copy.deepcopy(pixels))
dist_matrix_rpca = pairwise_distances(copy.deepcopy(X_rpca_features), metric="seuclidean")
def closest_image(dmat_rpca, idx1):
distances = copy.deepcopy(dmat_rpca[idx1,:])
distances[idx1] = np.infty
idx2 = np.argmin(distances)
distances[idx2] = np.infty
idx3 = np.argmin(distances)
plt.figure(figsize=(10,16))
plt.subplot(1,3,1)
plt.imshow(pixels[idx1].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Original-RPCA")
plt.grid(False)
plt.subplot(1,3,2)
plt.imshow(pixels[idx2].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Closest-RPCA")
plt.grid(False)
plt.subplot(1,3,3)
plt.imshow(pixels[idx3].reshape((h,w)),cmap=plt.cm.gray)
plt.title("Next Closest-RPCA")
plt.grid(False)
widgets.interact(closest_image,idx1=(0,n_samples-1,1), dmat_rpca=fixed(dist_matrix_rpca), manual=True)
interactive(children=(IntSlider(value=311, description='idx1', max=623), Output()), _dom_classes=('widget-inte…
<function __main__.closest_image(dmat_rpca, idx1)>
We can make side-by-side comparisons of the images and their PCA and RPCA counterparts, but we want to know how accurate our predictions are when it comes to the entire dataset of images. We used the K-nearest neighbor algorithm to calculate accuracy of both models.
# checking accuracy of PCA with KNN
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_rpca = KNeighborsClassifier(n_neighbors=1)
X_pca_features = pca.transform(copy.deepcopy(pixels))
X_rpca_features = rpca.transform(copy.deepcopy(pixels))
class_int=np.asarray(LabelEncoder().fit_transform(Name))
pca_train, pca_test, rpca_train, rpca_test, y_train, y_test = train_test_split(X_pca_features,X_rpca_features, class_int, test_size=0.2, train_size=0.8)
knn_pca.fit(pca_train,y_train)
acc_pca = accuracy_score(knn_pca.predict(pca_test),y_test)
knn_rpca.fit(rpca_train,y_train)
acc_rpca = accuracy_score(knn_rpca.predict(rpca_test),y_test)
print(f"PCA accuracy:{100*acc_pca:.2f}% \nRPCA Accuracy:{100*acc_rpca:.2f}%".format())
PCA accuracy:76.00% RPCA Accuracy:76.00%
Now that we have created CPA and RCPA models and used them with our dataset, and we've come to the conclusion that there is no difference in quality between their performance. With a desired variance of 0.95, both models need 244 components. When we make predictations in our dataset, both models produce an accuracy of 76.00%.
For further comparison, refer to the interactive graph below. You may choose any picture in the dataset, in which the graph will display the original image, the image result of the PCA and RPCA model.
def plt_reconstruct(idx_to_reconstruct):
idx_to_reconstruct = np.round(idx_to_reconstruct)
x_flat = pixels[idx_to_reconstruct].reshape(1, -1)
reconstructed_image = pca.inverse_transform(pca.transform(x_flat.copy()))
reconstructed_image_rpca = rpca.inverse_transform(rpca.transform(x_flat.copy()))
plt.figure(figsize=(15,7))
plt.subplot(1,3,1) # original
plt.imshow(x_flat.reshape((h, w)), cmap=plt.cm.gray)
plt.title("Original"+f", {x_flat.shape[1]} elements")
plt.grid(False)
plt.subplot(1,3,2) # pca
plt.imshow(reconstructed_image.reshape((h, w)), cmap=plt.cm.gray)
plt.title(f"PCA, {n_components} elements")
plt.grid(False)
plt.subplot(1,3,3) # randomized pca
plt.imshow(reconstructed_image_rpca.reshape((h, w)), cmap=plt.cm.gray)
plt.title(f"Randomized PCA, {n_components} elements")
plt.grid(False)
widgets.interact(plt_reconstruct,idx_to_reconstruct=(0,n_samples - 1,1),__manual=True)
interactive(children=(IntSlider(value=311, description='idx_to_reconstruct', max=623), Output()), _dom_classes…
<function __main__.plt_reconstruct(idx_to_reconstruct)>
DAISY is an image feature extraction technique that is very efficient in reducing computational time and power for dense image descriptors. Lets use it on an image from our dataset.
We will be changing the step, radius, and histogram parameters to see how the DAISY results vary.
from skimage.feature import daisy
from skimage.io import imshow
# choose a random image and resize it
idx_to_reconstruct = int(np.random.rand(1)*len(pixels))
img = pixels[idx_to_reconstruct].reshape((h,w))
# extract the image features
features, img_desc = daisy(img, step = 50,radius=20, rings=2, histograms=8, orientations=8, visualize=True)
imshow(img_desc)
plt.grid(False)
Now that we have our features extracted, lets implement the DAISY technique unto an image.
# function to apply DAISY to image
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), step=20, radius=20, rings=2, histograms=8, orientations=4,visualize=False)
return feat.reshape((-1))
# function to compare original image to closest image
def daisy_result(pixels, shape, idx1):
# apply DAISY
daisy_features = np.apply_along_axis(apply_daisy, 1, pixels, shape)
# find the pairwise distance between all the different image features
dist_matrix = pairwise_distances(daisy_features)
# find closest image to current image
distances = copy.deepcopy(dist_matrix[idx1,:])
distances[idx1] = np.infty # dont pick the same image!
idx2 = np.argmin(distances)
# plot original image
plt.figure(figsize=(7,10))
plt.subplot(1,2,1)
imshow(pixels[idx1].reshape((h,w)))
plt.title("Original Image")
plt.grid()
# plot closest image
plt.subplot(1,2,2)
imshow(pixels[idx2].reshape((h,w)))
plt.title("Closest Image")
plt.grid()
daisy_result(pixels, (h,w), idx_to_reconstruct)
With our current parameters, the closest image looks very similar to the original image!
Lets try changing the parameters for our image extraction.
This time we'll decrease steps to get more image extractors, increase radius to expand area of descriptor, and increase the number of calculated histograms for collecting more image data.
# choose random image and extract features
idx_to_reconstruct = int(np.random.rand(1)*len(pixels))
img = pixels[idx_to_reconstruct].reshape((h,w))
# extract the image features
features, img_desc = daisy(img, step = 20,radius=30, rings=2, histograms=12, orientations=8, visualize=True)
imshow(img_desc)
plt.grid(False)
daisy_result(pixels, (h,w), idx_to_reconstruct)
The prediction was not correct, but the images are very similar in body shape, and the fact that both patients are infected with pneumonia. Lets change the parameters once more.
We increased the number of steps to reduce the number of descriptors. We increased radius to expand the area of each descriptor, and decreased the number of calculated histograms to 6. Lets see what the results show.
# choose random image and extract features
idx_to_reconstruct = int(np.random.rand(1)*len(pixels))
img = pixels[idx_to_reconstruct].reshape((h,w))
# extract the image features
features, img_desc = daisy(img, step = 100, radius = 50 , rings=2, histograms = 6, orientations=8, visualize=True)
imshow(img_desc)
plt.grid(False)
daisy_result(pixels, (h,w), idx_to_reconstruct)
Our prediction was wrong once more, but yet again the images are similar.
We've been comparing images with others that look similar, but how accurate is DAISY when it comes to making predictions?
# calculating DAISY accuracy using KNN
knn_dsy = KNeighborsClassifier(n_neighbors=1)
daisy_features = np.apply_along_axis(apply_daisy, 1, pixels, (h,w))
dsy_train, dsy_test, y_train, y_test = train_test_split(daisy_features, class_int, test_size=0.2, train_size=0.8)
knn_dsy.fit(dsy_train,y_train)
acc_dsy = accuracy_score(knn_dsy.predict(dsy_test),y_test)
print(f"Daisy Accuracy:{100*acc_dsy:.2f}%".format())
Daisy Accuracy:82.40%
DAISY is very promising when it comes to extracting features and making predictions with our dataset. It is slightly more accurate in comparison to the PCA and RCPA, although it should be noted that it takes longer.
Gabor filters are another technique used in image processing to extract features.
from skimage.filters import gabor_kernel
from scipy import ndimage as ndi
from scipy import stats
# prepare filter bank kernels
kernels = []
for theta in range(8):# orientations
theta = theta / 8. * np.pi
for sigma in (1, 3, 5): # std
for frequency in (0.05, 0.15, 0.25, 0.35): # frequency
kernel = np.real(gabor_kernel(frequency, theta=theta, sigma_x=sigma, sigma_y=sigma))
kernels.append(kernel)
def compute_gabor(row, kernels, shape):
feats = np.zeros((len(kernels), 4), dtype=np.double)
for k, kernel in enumerate(kernels):
filtered = ndi.convolve(row.reshape(shape), kernel, mode='wrap')
_,_,feats[k,0],feats[k,1],feats[k,2],feats[k,3] = stats.describe(filtered.reshape(-1))
return feats.reshape(-1)
# choose random image and extract features
idx_to_reconstruct = int(np.random.rand(1)*len(pixels))
# takes some time to run entire dataset (using lots of orientations, std, and frequency)
gabor_stats = np.apply_along_axis(compute_gabor, 1, pixels, kernels, (h,w))
We've implemented the gabor filter technique unto our dataset, but lets see how accurate our predictions are when we use it.
# Calculate Gabor accuracy using KNN
knn_gab = KNeighborsClassifier(n_neighbors=1)
gab_train, gab_test, y_train, y_test = train_test_split(gabor_stats,class_int,test_size = 0.2, train_size = 0.8)
knn_gab.fit(np.nan_to_num(gab_train),
y_train)
acc_gab = accuracy_score(knn_gab.predict(np.nan_to_num(gab_test)),y_test)
print(f"Gabor accuracy: {100*acc_gab:.2f}%")
Gabor accuracy: 76.80%
def find_if_array_contains_nans(array):
return any(np.isnan(array).flatten())
print(find_if_array_contains_nans(gab_train))
print(find_if_array_contains_nans(y_train))
print(find_if_array_contains_nans(gab_test))
True False True
Using Gabor filters, we got an accuracy of 76.80%. It is not promising when it comes to extracting features from our image dataset due to NAN values in the gab_train.
from skimage.feature import match_descriptors
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), step=5, radius=5, rings=2, histograms=8, orientations=4, visualize=False)
s = feat.shape
return feat.reshape((s[0]*s[1],s[2]))
d1 = apply_daisy(pixels[10],(h,w))
d2 = apply_daisy(pixels[11],(h,w))
d3 = apply_daisy(pixels[0],(h,w))
print(d1.shape, d2.shape, d3.shape)
print('Classes:',class_int[0],class_int[1])
# return list of the key points indices that matched closely enough
matches = match_descriptors(d1, d2, cross_check=True, max_ratio=0.8)
print(f"Number of matches, same class: {matches.shape[0]}, Percentage: {100*matches.shape[0]/len(d1):0.2f}%")
# return list of the key points indices that matched closely enough
matches = match_descriptors(d1, d3, cross_check=True, max_ratio=0.8)
print(f"Number of matches, diff classes: {matches.shape[0]}, Percentage: {100*matches.shape[0]/len(d1):0.2f}%")
(1849, 68) (1849, 68) (1849, 68) Classes: 1 1 Number of matches, same class: 108, Percentage: 5.84% Number of matches, diff classes: 41, Percentage: 2.22%
We've used PCA, RPCA, Daisy, and Galbor filters to make predictions on our datasets. Lets see which technique produces the highest accuracy.
# print accuracy of all methods
print(f"PCA Accuracy: {100*acc_pca:.2f}%")
print(f"RPCA Accuracy: {100*acc_rpca:.2f}%")
print(f"Daisy Accuracy: {100*acc_dsy:.2f}%")
print(f"Gabor Accuracy: {100*acc_gab:.2f}%")
PCA Accuracy: 76.00% RPCA Accuracy: 76.00% Daisy Accuracy: 82.40% Gabor Accuracy: 76.80%
Out of all techniques, Daisy is the highest with 82.40% accuracy. Therefore, we would chose Daisy out of all the techniques we have implemented.
For this dataset and it purpose, it is very important for our model to have the highest accuracy possible, thought it takes longer in comparison to other techniques. Our highest accuracy was 82.40%, but it is short of the 98% target accuracy level we wanted. We believe that this model is not ready to be used in the medical industry.
A potential factor for the lower-than-expected accuracy level is that we reduced the size of the images in order for our code to run. This could have potentially affected the accuracy of the models we created. Since data quality is crucial for these models to produce quality results, it is important to consider the size of images. Another possible factor could be the NAN values generated by the Gabor filter technique.
Code used in this notebook was adapted from Professor Larson's notebooks for this class.